cover text
A Content-Preserving Secure Linguistic Steganography
Xiang, Lingyun, Ou, Chengfu, He, Xu, Yang, Zhongliang, Liu, Yuling
Existing linguistic steganography methods primarily rely on content transformations to conceal secret messages. However, they often cause subtle yet looking-innocent deviations between normal and stego texts, posing potential security risks in real-world applications. To address this challenge, we propose a content-preserving linguistic steganography paradigm for perfectly secure covert communication without modifying the cover text. Based on this paradigm, we introduce CLstega (\textit{C}ontent-preserving \textit{L}inguistic \textit{stega}nography), a novel method that embeds secret messages through controllable distribution transformation. CLstega first applies an augmented masking strategy to locate and mask embedding positions, where MLM(masked language model)-predicted probability distributions are easily adjustable for transformation. Subsequently, a dynamic distribution steganographic coding strategy is designed to encode secret messages by deriving target distributions from the original probability distributions. To achieve this transformation, CLstega elaborately selects target words for embedding positions as labels to construct a masked sentence dataset, which is used to fine-tune the original MLM, producing a target MLM capable of directly extracting secret messages from the cover text. This approach ensures perfect security of secret messages while fully preserving the integrity of the original cover text. Experimental results show that CLstega can achieve a 100\% extraction success rate, and outperforms existing methods in security, effectively balancing embedding capacity and security.
Adaptive Data-Resilient Multi-Modal Hierarchical Multi-Label Book Genre Identification
Nareti, Utsav Kumar, Chattopadhyay, Soumi, Mallick, Prolay, Kumar, Suraj, Adak, Chandranath, Daga, Ayush Vikas, Wase, Adarsh, Roy, Arjab
Identifying fine-grained book genres is essential for enhancing user experience through efficient discovery, personalized recommendations, and improved reader engagement. At the same time, it provides publishers and marketers with valuable insights into consumer preferences and emerging market trends. While traditional genre classification methods predominantly rely on textual reviews or content analysis, the integration of additional modalities, such as book covers, blurbs, and metadata, offers richer contextual cues. However, the effectiveness of such multi-modal systems is often hindered by incomplete, noisy, or missing data across modalities. To address this, we propose IMAGINE (Intelligent Multi-modal Adaptive Genre Identification NEtwork), a framework designed to leverage multi-modal data while remaining robust to missing or unreliable information. IMAGINE learns modality-specific feature representations and adaptively prioritizes the most informative sources available at inference time. It further employs a hierarchical classification strategy, grounded in a curated taxonomy of book genres, to capture inter-genre relationships and support multi-label assignments reflective of real-world literary diversity. A key strength of IMAGINE is its adaptability: it maintains high predictive performance even when one modality, such as text or image, is unavailable. We also curated a large-scale hierarchical dataset that structures book genres into multiple levels of granularity, allowing for a more comprehensive evaluation. Experimental results demonstrate that IMAGINE outperformed strong baselines in various settings, with significant gains in scenarios involving incomplete modality-specific data.
TREND: A Whitespace Replacement Information Hiding Method
Hellmeier, Malte, Norkowski, Hendrik, Schrewe, Ernst-Christoph, Qarawlus, Haydar, Howar, Falk
Large Language Models (LLMs) have gained significant popularity in recent years. Differentiating between a text written by a human and a text generated by an LLM has become almost impossible. Information hiding techniques such as digital watermarking or steganography can help by embedding information inside text without being noticed. However, existing techniques, such as linguistic-based or format-based methods, change the semantics or do not work on pure, unformatted text. In this paper, we introduce a novel method for information hiding termed TREND, which is able to conceal any byte-encoded sequence within a cover text. The proposed method is implemented as a multi-platform library using the Kotlin programming language, accompanied by a command-line tool and a web interface provided as examples of usage. By substituting conventional whitespace characters with visually similar Unicode whitespace characters, our proposed scheme preserves the semantics of the cover text without increasing the number of characters. Furthermore, we propose a specified structure for secret messages that enables configurable compression, encryption, hashing, and error correction. Our experimental benchmark comparison on a dataset of one million Wikipedia articles compares ten algorithms from literature and practice. It proves the robustness of our proposed method in various applications while remaining imperceptible to humans. We discuss the limitations of limited embedding capacity and further robustness, which guide implications for future work.
Robust Multi-bit Natural Language Watermarking through Invariant Features
Yoo, KiYoon, Ahn, Wonhyuk, Jang, Jiho, Kwak, Nojun
Recent years have witnessed a proliferation of valuable original natural language contents found in subscription-based media outlets, web novel platforms, and outputs of large language models. However, these contents are susceptible to illegal piracy and potential misuse without proper security measures. This calls for a secure watermarking system to guarantee copyright protection through leakage tracing or ownership identification. To effectively combat piracy and protect copyrights, a multi-bit watermarking framework should be able to embed adequate bits of information and extract the watermarks in a robust manner despite possible corruption. In this work, we explore ways to advance both payload and robustness by following a well-known proposition from image watermarking and identify features in natural language that are invariant to minor corruption. Through a systematic analysis of the possible sources of errors, we further propose a corruption-resistant infill model. Our full method improves upon the previous work on robustness by +16.8% point on average on four datasets, three corruption types, and two corruption ratios. Code available at https://github.com/bangawayoo/nlp-watermarking.
Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset
Hanif, Sidra, Latecki, Longin Jan
In recent years, there is a growing number of pre-trained models trained on a large corpus of data and yielding good performance on various tasks such as classifying multimodal datasets. These models have shown good performance on natural images but are not fully explored for scarce abstract concepts in images. In this work, we introduce an image/text-based dataset called Greeting Cards. Dataset (GCD) that has abstract visual concepts. In our work, we propose to aggregate features from pretrained images and text embeddings to learn abstract visual concepts from GCD. This allows us to learn the text-modified image features, which combine complementary and redundant information from the multi-modal data streams into a single, meaningful feature. Secondly, the captions for the GCD dataset are computed with the pretrained CLIP-based image captioning model. Finally, we also demonstrate that the proposed the dataset is also useful for generating greeting card images using pre-trained text-to-image generation model.
Addressing Segmentation Ambiguity in Neural Linguistic Steganography
Previous studies on neural linguistic steganography, except Ueoka et al. (2021), overlook the fact that the sender must detokenize cover texts to avoid arousing the eavesdropper's suspicion. In this paper, we demonstrate that segmentation ambiguity indeed causes occasional decoding failures at the receiver's side. With the near-ubiquity of subwords, this problem now affects any language. We propose simple tricks to overcome this problem, which are even applicable to languages without explicit word boundaries.
General Framework for Reversible Data Hiding in Texts Based on Masked Language Modeling
Zheng, Xiaoyan, Fang, Yurun, Wu, Hanzhou
With the fast development of natural language processing, recent advances in information hiding focus on covertly embedding secret information into texts. These algorithms either modify a given cover text or directly generate a text containing secret information, which, however, are not reversible, meaning that the original text not carrying secret information cannot be perfectly recovered unless much side information are shared in advance. To tackle with this problem, in this paper, we propose a general framework to embed secret information into a given cover text, for which the embedded information and the original cover text can be perfectly retrieved from the marked text. The main idea of the proposed method is to use a masked language model to generate such a marked text that the cover text can be reconstructed by collecting the words of some positions and the words of the other positions can be processed to extract the secret information. Our results show that the original cover text and the secret information can be successfully embedded and extracted. Meanwhile, the marked text carrying secret information has good fluency and semantic quality, indicating that the proposed method has satisfactory security, which has been verified by experimental results. Furthermore, there is no need for the data hider and data receiver to share the language model, which significantly reduces the side information and thus has good potential in applications.
Semantic-Preserving Linguistic Steganography by Pivot Translation and Semantic-Aware Bins Coding
Yang, Tianyu, Wu, Hanzhou, Yi, Biao, Feng, Guorui, Zhang, Xinpeng
Linguistic steganography (LS) aims to embed secret information into a highly encoded text for covert communication. It can be roughly divided to two main categories, i.e., modification based LS (MLS) and generation based LS (GLS). Unlike MLS that hides secret data by slightly modifying a given text without impairing the meaning of the text, GLS uses a trained language model to directly generate a text carrying secret data. A common disadvantage for MLS methods is that the embedding payload is very low, whose return is well preserving the semantic quality of the text. In contrast, GLS allows the data hider to embed a high payload, which has to pay the high price of uncontrollable semantics. In this paper, we propose a novel LS method to modify a given text by pivoting it between two different languages and embed secret data by applying a GLS-like information encoding strategy. Our purpose is to alter the expression of the given text, enabling a high payload to be embedded while keeping the semantic information unchanged. Experimental results have shown that the proposed work not only achieves a high embedding payload, but also shows superior performance in maintaining the semantic consistency and resisting linguistic steganalysis.